Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

نویسندگان

Yu-Gang Jiang

Zuxuan Wu

Jinhui Tang

Zechao Li

Xiangyang Xue

Shih-Fu Chang

چکیده

Videos are inherently multimodal. This paper studies the problem of how to fully exploit the abundant multimodal clues for improved video categorization. We introduce a hybrid deep learning framework that integrates useful clues from multiple modalities, including static spatial appearance information, motion patterns within a short time window, audio information as well as long-range temporal dynamics. More specifically, we utilize three Convolutional Neural Networks (CNNs) operating on appearance, motion and audio signals to extract their corresponding features. We then employ a feature fusion network to derive a unified representation with an aim to capture the relationships among features. Furthermore, to exploit the long-range temporal dynamics in videos, we apply two Long Short Term Memory networks with extracted appearance and motion features as inputs. Finally, we also propose to refine the prediction scores by leveraging contextual relationships among video semantics. The hybrid deep learning framework is able to exploit a comprehensive set of multimodal features for video classification. Through an extensive set of experiments, we demonstrate that (1) LSTM networks which model sequences in an explicitly recurrent manner are highly complementary with CNN models; (2) the feature fusion network which produces a fused representation through modeling feature relationships outperforms alternative fusion strategies; (3) the semantic context of video classes can help further refine the predictions for improved performance. Experimental results on two challenging benchmarks, the UCF-101 and the Columbia Consumer Videos (CCV), provide strong quantitative evidence that our framework achieves promising results: 93.1% on the UCF-101 and 84.5% on the CCV, outperforming competing methods with clear margins.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fusing Multi-Stream Deep Networks for Video Classification

This paper studies deep network architectures to address the problem of video classification. A multi-stream framework is proposed to fully utilize the rich multimodal information in videos. Specifically, we first train three Convolutional Neural Networks to model spatial, short-term motion and audio clues respectively. Long Short Term Memory networks are then adopted to explore long-term tempo...

متن کامل

Sequential Deep Learning for Disaster-Related Video Classification

Videos serve to convey complex semantic information and ease the understanding of new knowledge. However, when mixed semantic meanings from different modalities (i.e., image, video, text) are involved, it is more difficult for a computer model to detect and classify the concepts (such as flood, storm, and animals). This paper presents a multimodal deep learning framework to improve video concep...

متن کامل

A Hybrid Method for Traffic Flow Forecasting Using Multimodal Deep Learning

Traffic flow forecasting has been regarded as a key problem of intelligent transport systems. In this work, we propose a hybrid multimodal deep learning method for short-term traffic flow forecasting, which jointly learns the spatial-temporal correlation features and interdependence of multi-modality traffic data by multimodal deep learning architecture. According to the highly nonlinear charac...

متن کامل

Multimodal sparse representation learning and applications

Unsupervised methods have proven effective for discriminative tasks in a singlemodality scenario. In this paper, we present a multimodal framework for learning sparse representations that can capture semantic correlation between modalities. The framework can model relationships at a higher level by forcing the shared sparse representation. In particular, we propose the use of joint dictionary l...

متن کامل

A Hybrid Framework for Building an Efficient Incremental Intrusion Detection System

In this paper, a boosting-based incremental hybrid intrusion detection system is introduced. This system combines incremental misuse detection and incremental anomaly detection. We use boosting ensemble of weak classifiers to implement misuse intrusion detection system. It can identify new classes types of intrusions that do not exist in the training dataset for incremental misuse detection. As...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1706.04508 شماره

صفحات -

تاریخ انتشار 2017

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

نویسندگان

چکیده

منابع مشابه

Fusing Multi-Stream Deep Networks for Video Classification

Sequential Deep Learning for Disaster-Related Video Classification

A Hybrid Method for Traffic Flow Forecasting Using Multimodal Deep Learning

Multimodal sparse representation learning and applications

A Hybrid Framework for Building an Efficient Incremental Intrusion Detection System

عنوان ژورنال:

اشتراک گذاری